Phylo-mLogo: An interactive multiple-logo visualization tool for large-number sequence alignments
نویسندگان
چکیده
When aligning several hundreds or thousands of sequences, such as HIVs, dengue virus, and influenza viruses, to reconstruct the epidemiological history or to understand the mechanisms of epidemic virus evolution, how to analyze and visualize the large-number alignment results has become a new challenge for computational biologists. Although there are several tools available for visualization of very long sequence alignments, few of them are applicable to the large-number alignments. In this paper, we present a multiple-logo alignment visualization tool, called Phylo-mLogo, which allows the user to visualize the global profile of whole multiple sequence alignment and to hierarchically visualize homologous logos of each clade simultaneously. Phylo-mLogo calculates the variabilities and homogeneities of alignment sequences by base frequencies or entropies. Different from the traditional representations of sequence logos, Phylo-mLogo not only displays the global logo patterns of the whole alignment but also demonstrates their local logos for each clade. In addition, Phylo-mLogo also allows the user to focus only on the analysis of some important structurally or functionally constrained sites in the alignment selected by the user or by built-in automatic calculation. With Phylo-mLogo, the user can symbolically and hierarchically visualize hundreds of aligned sequences simultaneously and easily check the sites of their amino acid changes when analyzing large-number human or avian influenza virus sequences. INTRODUCTION Epidemic viruses, such as human immunodeficiency virus (HIV), influenza viruses, and dengue virus, continuously pose threats to human health, especially the recent outbreak of H5N1 avian influenza virus infection in human, which causes over 50% deaths among the 218 confirmed cases by WHO ( http://www.who.int/csr/disease/avian_influenza/country/en/index.html ) (Moya et al. 2004). Therefore, it is important and urgent for biologists to understand the mechanisms of epidemic virus evolution and the epidemiological history. In addition to exponentially growing virus data available in GenBank, two ongoing large-scale sequencing projects of human and avian influenza viruses have released a number of complete virus genomes (Ghedin et al. 2005; Obenauer et al. 2006) and conducted several significant studies (Campitelli et al. 2006; Holmes et al. 2005; Obenauer et al. 2006). Different from those used for identification of conserved regions in comparative genomics, the sequences analyzed for epidemiology are usually much shorter and more conserved, and their number could be in the range of several hundreds to thousands. For examples, Homles et al. (2005) performed a phylogenetic analysis of 156 human H3N2 influenza A viruses and observed multiple co-circulating clades (Holmes et al. 2005). Recently, Campitelli et al. (2006) analyzed 685 human and avian sequences and found that viral genes appeared to be under strong purifying selection, with only the PB2, HA and NS1 genes under positive selection (Campitelli et al. 2006). Moreover, Obenauer et al. (2006) compared 4339 avian influenza virus genes and identified several novel clades never found before (Obenauer et al. 2006). Therefore, aligning large-number virus sequences can help researchers identify important polymorphic sites between different lineages and find out also the evolutionary histories and mutation trends of influenza viruses. Sequence alignment and inference of the phylogenies is a standard procedure for analyzing virus sequences ( http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html ). Based on the reconstructed phylogenetic relationship, the evolutionary histories of epidemic viruses can be inferred. Traditionally researchers are used to assigning numbers to all clades in the phylogenetic analysis of individual gene segments and using them to represent and compare genotypes across multiple viruses. However, when the number of analyzed sequences is in the hundreds, this approach cannot distinguish the differences between viral sequences from different strains (Obenauer et al. 2006). Moreover, the evolutionary changes of some specific sites, such as antigenic sites, cannot be directly observed by global phylogenetic analysis, short of checking the detailed alignment results. Therefore, how to provide efficient tools for biologists to analyze and visualize large-number sequence alignments of viruses has become a challenge for computational biologists. In recent years there are several visualization tools of sequence alignments available in the public domain. Based on the visualization output, these tools can be divided into two categories: curve-based and sequence-logo-based. In the former category, tools, such as VISTA family (Shah et al. 2004), PipMaker (Schwartz et al. 2003; Schwartz et al. 2000), zPicture (Ovcharenko et al. 2004), and SinicView (Shih et al. 2006), are developed to either visualize individual alignment results or compare and evaluate assorted alignment results obtained by different tools. These tools are useful for visualizing very long sequence alignment results of a few sequences. However, for cases of large-number but short sequence alignments, they are impractical because some significant variations between sequences may be submerged by global scoring profiles which are calculated by either identical rates or sum-of-pair scores. In the latter, sequence logos graphically represent the informative patterns of each individual site in a multiple sequence alignment. Thus, the sequence logos can assist users to discover and identify conserved patterns from multiple sequence alignments (Schneider and Stephens 1990). The original work was first proposed by Schneider and Stephens (1990). Ten years later, Crooks et al. (2004) performed an extension that incorporates additional features and options, called WebLogo (Crooks et al. 2004). For distinguishing the gaps and poorly conserved positions, LogoBar (Perez-Bercoff et al. 2006) was proposed to display protein sequence logos including not only amino acids but also gaps. These logo-based tools are very useful to globally visualize consensus patterns in a multiple sequence alignment. However, when the number of aligned sequences is in several hundreds or thousands, some significant local tendencies of mutations cannot be observed directly from these global logo-based profiles. In the analysis of influenza virus evolution, tracking the transitional changes of the amino acids at the epitope or receptor binding sites is very important because their changes could cause antigenic drift (Bush et al. 1999), affect viral transcription (Gabriel et al. 2005), and conduce mammalian adaptation (Subbarao et al. 1993). Furthermore, since their mutation rates are much faster than those of eukaryotes (Li 1997), observing dynamic evolutionary transitions of viruses can help the researchers examine the functional and evolutionary characteristics of influenza. In this paper, we present a multiple-logo alignment visualization tool, called Phylo-mLogo, which allows the user to visualize the global profile of the whole multiple sequence alignment and to hierarchically visualize homologous logos of each clade simultaneously. Phylo-mLogo calculates the variabilities and homogeneities of aligned sequences by base frequencies or entropies. Different from the traditional representations of sequence logos, Phylo-mLogo not only displays the global logo patterns of the whole alignment but also demonstrates their local logos for each clade. In addition, Phylo-mLogo also allows the user to focus only on the analysis of some important structurally or functionally constrained sites in the alignment selected by the user or by built-in automatic calculation. With Phylo-mLogo, the user can symbolically and hierarchically visualize hundreds of aligned sequences simultaneously and easily check the sites of their amino acid changes when analyzing large-number sequences of human or avian influenza viruses. RESULTS In influenza viruses, the surface glycoproteins hemagglutinin (HA) is the most important target for the human immune system. Recent studies reveal that modifications of HA1, the immunogenic part of HA, accrue at a dramatic rate and also indicate that HA1 is undergoing diversifying or positive selection (Bush et al. 1999; Fitch et al. 1991; Plotkin and Dushoff 2003). Since their HA1 genes mutate so fast, the new variant strains of H3N2 tend to replace older ones quickly so as to cause annual outbreaks. Thus, to identify the sites under selection and their mutation trends in the HA genes is very important (Bush et al. 1999). In what follows, we will introduce two examples in the study of influenza HA genes to demonstrate how Phylo-mLogo can assist users to observe and analyze large-number sequences alignment results. The total numbers of alignment sequences in both of the examples are 453 and 207, respectively. The relationships of the aligned influenza sequences are acquired by human and avian in each example. Example 1: 453 avian influenza HA genes The spread of H5N1 avian influenza from China to Europe has raised global concern about their potential to infect humans and cause a pandemic. A more comprehensive collection of data and analysis of avian influenza sequences is critically needed for biologists and epidemiologists to find out the virulence and transmissibility of these viruses from avian species to humans. Thus, Obenauer et al. (2006) established the first large-scale sequencing effort to collect additional genomic data on the avian population of influenza A viruses (Obenauer et al. 2006). They introduced a proteotyping method to identify and number unique amino acid signatures, called proteotypes, for sequences that may or may not be distinguished by branches on a phylogenetic tree. They analyzed eight avian influenza genes and provided the proteograms to demonstrate the amino acid signatures within each clade (Figures S2-S9 in the supplementary material of (Obenauer et al. 2006)). Based on the observations, they concluded that the virus families tend to have multiple core conserved genes and that the surface glycoproteins, HA and NA, appear to be more freely exchanged than core proteins because of immune pressure (Obenauer et al. 2006). In this part, we downloaded 437 avian influenza HA genes used for analysis in (Obenauer et al. 2006). To infer the phylogenetic tree of these sequences by MrBayes is very time consuming (Obenauer et al. 2006; Ronquist and Huelsenbeck 2003), we therefore observed the tree shown in Fig. S6 in (Obenauer et al. 2006) directly and constructed their phylogenetic relationship manually. The proteotypes of the analyzed sequences include p1.1, p2.1, p5.1-4, p6.1-6, p8.1, p9.1, p9.2, and p12.1. Based on these proteotypes, we first aligned the sequences of each proteotype and then aligned these proteotypes together, by ClustalW (Thompson et al. 1994). The total alignment length is 584. Figure 2A shows the sequence logos and their phylogenetic tree, simultaneously. Different from other tools for tree visualization (http://www.genetics.wustl.edu/eddy/atv/), Plylo-mLogo displays the phylogenetic tree by using a standard file browser because this representation is more compact than that of the traditional tree visualization of the original phylogenetic tree as shown in Fig. 2B. Thus, the user can click on different clades shown in yellow background colors, like selecting different folders in a file browser, to visualize the sequence logos of the alignment at different levels. Stevens et al. (2006) listed some conserved residues with the receptor binding domains of H1 and H5 serotypes that are implicated in receptor specificity, amino acid positions 183, 190, 193, 194, 216, 221, 222, and 225-8 (Stevens et al. 2006) of which the corresponding positions in our example are 205, 212, 215, 216, 238, 243, 244, and 247-250, respectively. Then, we compared these sequence logos between different proteotypes. To avoid confusion in this example, we used the original positions shown in Stevens et al. (2006) in the following discussion. As shown in Fig. 2C, the amino acids at residue sites 194, 225, and 228 are almost conserved across H1, H2, H5, H6, H8, H9, and H12 serotypes. If we only consider H1, H2, and H6, the same clades with H5 (Glaser et al. 2005; Obenauer et al. 2006), the amino acids at sites 183, 190, 194, 225, 226, and 228 are almost the same across these serotypes. Interestingly, we found that the majority at residue 221 of the proteotype p5.3 is the amino acid S not P in the reference avian strain, A/Duck/Singapore/Q-F119-3/1997 (H5 serotype) used for comparison in (Stevens et al. 2006), while 221S has been fixed in the human H5N1 strain, A/Vietnam/1203/2004. Since all sequences in p5.3 belong to avian H5N1, it appears that the 221S (polar amino acid) has almost taken over 221P (non-polar amino acid) in the avian H5N1 population. At residue 216, the polymorphisms of amino acids, K, E, and R, are all found in proteotype p5.3 while the amino acids are 216E and 216R for A/Duck/Singapore/Q-F119-3/1997 and A/Vietnam/1203/2004, respectively. Compared with those for H1, H2, and H6, 216K seems to be an advantageous mutation and may be probably fixed at the site in the avian H5N1 strains later. Since the amino acids K and R are positive charged but E is negative charged, the receptor specificity of residue 216 could definitely be changed in the avian H5N1 population consequently. Moreover, the cleavability of the HA molecule for avian influenza A viruses plays a major role in virulence in birds and the amino acid sequence at the HA cleavage site, PQRERRRKKR/G, is considered as the most important pattern (Hatta et al. 2001). Between sites 352 and 355, we also identified this amino acid profile, PQRERRRKKR, in the sequence logo of p5.3 in which almost all 132 HA sequences belong to H5N1, while only few of H5N1 isolates are grouped in other proteotypes p5.1, p5.2, and p5.4. Thus, this pattern seems to appear only in H5N1. Briefly, Phylo-mLogo can assist users to compare and visualize the changes of polymorphisms and indel events across different clades or subtypes of large-number sequence alignment so that users could speculate possible evolutionary and functional mechanisms to examine their hypotheses further. Example 2: 207 human influenza H3N2 isolates collected from New York
منابع مشابه
Phylo-VISTA: An Interactive Visualization Tool for Multiple DNA Sequence Alignments
We have developed Phylo-VISTA (Shah et al., 2003), an interactive software tool for analyzing multiple alignments by visualizing a similarity measure for DNA sequences of multiple species. The complexity of visual presentation is effectively organized using a framework based upon inter-species phylogenetic relationships. The phylogenetic organization supports rapid, user-guided inter-species co...
متن کاملPhylo-VISTA: interactive visualization of multiple DNA sequence alignments
MOTIVATION The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate cons...
متن کاملxREI: a phylo-grammar visualization webserver
Phylo-grammars, probabilistic models combining Markov chain substitution models with stochastic grammars, are powerful models for annotating structured features in multiple sequence alignments and analyzing the evolution of those features. In the past, these methods have been cumbersome to implement and modify. xrate provides means for the rapid development of phylo-grammars (using a simple fil...
متن کاملProfileGrids: a sequence alignment visualization paradigm that avoids the limitations of Sequence Logos
BACKGROUND The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein alignments. As an alternative, ProfileGrids are a new protein sequence alignment ...
متن کاملSpial: analysis of subtype-specific features in multiple sequence alignments of proteins
MOTIVATION Spial (Specificity in alignments) is a tool for the comparative analysis of two alignments of evolutionarily related sequences that differ in their function, such as two receptor subtypes. It highlights functionally important residues that are either specific to one of the two alignments or conserved across both alignments. It permits visualization of this information in three comple...
متن کامل